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Abstract 

Time series prediction covers a vast field of every-day statistical ap- 
plications in medical, environmental and economic domains. In this 
paper we develop nonparametric prediction strategies based on the 
combination of a set of "experts" and show the universal consistency 
of these strategies under a minimum of conditions. We perform an in- 
depth analysis of real-world data sets and show that these nonpara- 
metric strategies are more flexible, faster and generally outperform 
ARMA methods in terms of normalized cumulative prediction error. 

Index Terms — Time series, sequential prediction, universal consis- 
tency, kernel estimation, nearest neighbor estimation, generalized lin- 
ear estimates. 
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1 Introduction 



The problem of time series analysis and prediction has a long and rich history, 
probably dating back to the pioneering work of Yule in 1927 [30]. The appli- 
cation scope is vast, as time series modeling is routinely employed across the 
entire and diverse range of applied statistics, including problems in genetics, 
medical diagnoses, air pollution forecasting, machine condition monitoring, 
financial investments, marketing and econometrics. Most of the research 
activity until the 1970s was concerned with parametric approaches to the 
problem whereby a simple, usually linear model is fitted to the data (for a 
comprehensive account we refer the reader to the monograph of Brockwell 
and Davies [2]). While many appealing mathematical properties of the para- 
metric paradigm have been established, it has become clear over the years 
that the limitations of the approach may be rather severe, essentially due 
to overly rigid constraints which are imposed on the processes. One of the 
more promising solutions to overcome this problem has been the extension of 
classic nonparametric methods to the time series framework (see for example 
Gyorfi, Hardle, Sarda and Vieu [16] and Bosq p] for a review and references). 

Interestingly, related schemes have been proposed in the context of se- 
quential investment strategies for financial markets. Sequential investment 
strategies are allowed to use information about the market collected from the 
past and determine at the beginning of a training period a portfolio, that 
is, a way to distribute the current capital among the available assets. Here, 
the goal of the investor is to maximize their wealth in the long run, without 
knowing the underlying distribution generating the stock prices. For more 
information on this subject, we refer the reader to Algoet [1], Gyorfi and 
Schafer ^T\, Gyorfi, Lugosi and Udina [TH], and Gyorfi, Udina and Walk 

The present paper is devoted to the nonparametric problem of sequential 
prediction of real valued sequences which we do not require to necessarily 
satisfy the classical statistical assumptions for bounded, autoregressive or 
Markovian processes. Indeed, our goal is to show powerful consistency results 
under a strict minimum of conditions. To fix the context, we suppose that 
at each time instant n = 1, 2, . . ., the statistician (also called the predictor 
hereafter) is asked to guess the next outcome Un of a sequence of real numbers 
7/1, 1)2, ■■ ■ with knowledge of the past ?/"~^ = (yi, . . . , Un-i) (where yl denotes 
the empty string) and the side information vectors = (xi, . . . , x„), where 
Xn G K^. In other words, adopting the perspective of on-line learning, the 
elements yo,yi,y2, ■ ■ ■ and xi,X2,-- - are revealed one at a time, in order, 
beginning with (xi, |/o), (x2, yi), . . ., and the predictor's estimate of ?/„ at 
time n is based on the strings ?/"~^ and x". Formally, the strategy of the 



2 



predictor is a sequence g = {fifnlJ^Li of forecasting functions 

and the prediction formed at time n is just 5'„(x", y"^^). 

Throughout the paper we will suppose that (xi, (x2, 2/2)5 •• • are re- 
alizations of random variables (Xi, Yi), (X2, 12), • • • such that the process 
{{Xn,Yn)}'^^ is jointly stationary and ergodic. 

After n time instants, the {normalized) cumulative squared prediction er- 
ror on the strings X" and is 

Ln{g) = lj2i9tiXlYr)-Yt)\ 

t=i 

Ideally, the goal is to make Ln{g) small. There is, however, a fundamental 
limit for the predictability of the sequence, which is determined by a result 
of Algoet p]: for any prediction strategy g and jointly stationary ergodic 
process {(X„,F„)}^^, 

liminf L„(5f) > L* almost surely, (1) 

n— >oo 

where 

L* = ^{{Yo-K{Yo\X'_^,Yl^Y} 

is the minimal mean squared error of any prediction for the value of Yq 
based on the infinite past observation sequences Fj^ = (• • • 7^-27^-1) and 
X°j^ = (. . . , X_2, X_i). Generally, we cannot hope to design a strategy 
whose prediction error exactly achieves the lower bound L*. Rather, we 
require that Ln{g) gets arbitrarily close to L* as n grows. This gives sense 
to the following definition: 

Definition 1.1 A prediction strategy g is called universally consistent with 
respect to a class C of stationary and ergodic processes {{Xn,Yn)}'^ao '^f foi^ 
each process in the class, 

lim Ln{g) = L* almost surely. 

n— >oo 

Thus, universally consistent strategies asymptotically achieve the best pos- 
sible loss for all processes in the class. Algoet [1] and Morvai, Yakowitz 
and Gyorfi [2^ proved that there exist universally consistent strategies with 
respect to the class C of all bounded, stationary and ergodic processes. How- 
ever, the prediction algorithms discussed in these papers are either very 
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complex or have an unreasonably slow rate of convergence, even for well- 
behaved processes. Building on the methodology developed in recent years 
for prediction of individual sequences (see Cesa-Bianchi and Lugosi [8] for 
a survey and references), Gyorfi and Lugosi introduced in [18] a histogram- 
based prediction strategy which is "simple" and yet universally consistent 
with respect to the class C. A similar result was also derived independently 
by Nobel [23. Roughly speaking, both methods consider several partition- 
ing estimates (called experts in this context) and combine them at time n 
according to their past performance. For this, a probability distribution on 
the set of experts is generated, where a "good" expert has relatively large 
weight, and the average of all experts' predictions is taken with respect to 
this distribution. 

The purpose of this paper is to further investigate nonparametric expert- 
oriented strategies for unbounded time series prediction. With this aim in 
mind, in Section 2.1 we briefly recall the histogram-based prediction strategy 
initiated in [18], which was recently extended to unbounded processes by 
Gyorfi and Ottucsak [2D]. In Section 2.2 and 2.3 we offer two "more flexible" 
strategies, called respectively kernel and nearest neighbor-based prediction 
strategies, and state their universal consistency with respect to the class of 
all (non-necessarily bounded) stationary and ergodic processes with finite 
fourth moment. In Section 2.4 we consider as an alternative a prediction 
strategy based on combining generalized linear estimates. In Section 2.5 
we use the techniques of the previous section to give a simpler prediction 
strategy for stationary Gaussian ergodic processes. Extensive experimental 
results based on real-life data sets are discussed in Section 3, and proofs of 
the main results are given in Section 4. 

2 Universally consistent prediction strategies 

2.1 Histogram-based prediction strategy 

In this section, we briefly describe the histogram-based prediction scheme due 
to Gyorfi and Ottucsak [20j for unbounded stationary and ergodic sequences. 
The strategy is defined at each time instant as a convex combination of 
elementary predictors (the so-called experts), where the weighting coefficients 
depend on the past performance of each elementary predictor. To be more 
precise, we first define an infinite array of experts h^'''^\ k,i = 1,2,... as 
follows. Let Ve = {Aij,j = 1,2, ... , rrif} be a sequence of finite partitions of 
W^, and let Qi = {B£j,j = 1,2, ... , m'^} be a sequence of finite partitions of 
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M. Introduce the corresponding quantizers: 

Fi{x) = j, if a; G A^j 

and 

Ge{y) = J, if y e Be J. 
To lighten notation a bit, for any n and 

^ (^^d^^ ^^j^g Feix'l) for the 
sequence F£(a;i), . . . , ^^(xn) and similarly, for ^ I^" "we write Ge{yi) for 
the sequence G'^(yi), . . . , Gi>{yn). 

The sequence of experts /i'^'^'^^ fc, £ = 1,2, .. . is defined as follows. Let 
Jn'^"^ be the locations of the matches of the last seen strings x'^_j^ of length k+ 
1 and DnZk of length k in the past according to the quantizer with parameters 
k and t 

4'''^ = {k<t<n: FeixU) = Fe{x:_,), G^iylzl) = Geiy^zl)} , 
and introduce the truncation function 

if 2; > a; 
if l-^l < a; 
if 2; < —a. 

Now define the elementary predictor h'^n'^'^ by 

where 0/0 is defined to be and 

0<5 < 1/8. 

Here and throughout, for any finite set J, the notation \J\ stands for the 
size of J. We note that the expert hn'^^ can be interpreted as a (truncated) 
histogram regression function estimate drawn in (M'^)^+^ xM^ (Gyorfi, Kohler, 
Krzyzak and Walk [17j). 

The proposed prediction algorithm proceeds with an exponential weight- 
ing average method. Formally, let {qk,e} be a probability distribution on the 
set of all pairs {k, i) of positive integers such that for all k and i, qk/ > 0. 
Fix a learning parameter > 0, and define the weights 



a 

z 
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and their normalized values 

Pk/,n Y^oo ■ 

The prediction strategy g at time n is defined by 

oo 

^„(x^l/ri)= ^Pfc,,>(f'^)(x^,l/ri), n = l,2,... 

k,£=l 

It is proved in [20] that this scheme is universally consistent with respect 
to the class of all (non-necessarily bounded) stationary and ergodic processes 
with finite fourth moment, as stated in the following theorem. Here and 
throughout the document, || ■ || denotes the Euclidean norm. 

Theorem 2.1 (Gyorfi and Ottucsak |20] ) Assume that 

(a) The sequence of partitions Ve is nested, that is, any cell of Ve+i is a 
subset of a cell ofVe, £ = 1,2, . . .; 

(h) The sequence of partitions Qi is nested; 

(c) The sequence of partitions Vn is asymptotically fine, i.e., if 

diam(y4) = sup — y\ 

x,y&A 

denotes the diameter of a set, then for each sphere S centered at the 
origin 

lim max diamfA^ ,) = 0; 

(d) The sequence of partitions Qi is asymptotically fine. 
Then, if we choose the learning parameter rjn of the algorithm as 
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the histogram-has ed prediction scheme g defined above is universally consis- 
tent with respect to the class of all jointly stationary and ergodic processes 
such that 

E{F(f } < oo. 
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The idea of combining a collection of concurrent estimates was originally 
developed in a non-stochastic context for on-line sequential prediction from 
deterministic sequences (see Cesa-Bianchi and Lugosi [H] for a comprehensive 
introduction). Following the terminology of the prediction literature, the 
combination of different procedures is sometimes termed aggregation in the 
stochastic context. The overall goal is always the same: use aggregation to 
improve prediction. For a recent review and an updated list of references, 
see Bunea and Nobel [6] and Bunea, Tsybakov and Wegkamp [7j. 



2.2 Kernel-based prediction strategies 

We introduce in this section a class of kernel-based prediction strategies for 
(non-necessarily bounded) stationary and ergodic sequences. The main ad- 
vantage of this approach in contrast to the histogram-based strategy is that 
it replaces the rigid discretization of the past appearances by more flexible 
rules. This also often leads to faster algorithms in practical applications. 

To simplify the notation, we start with the simple "moving-window" 
scheme, corresponding to a uniform kernel function, and treat the general 
case briefly later. Just like before, we define an array of experts h^'''^\ where 
k and i are positive integers. We associate to each pair {k, t) two radii r^/ > 
and r^^ > such that, for any fixed k 

lim rk,t = 0, (2) 

£— >oo 

and 

lim r^i = 0. (3) 
Finally, let the location of the matches be 

= {k<t<n: \\xU - < r.,., Uzl - y^ZlW < r^,} . 

Then the elementary expert hn'^"^ at time n is defined by 

hi'^'\xl yr') = T^Mn^i} (^^^[jMp j ' n>k + l, (4) 

where 0/0 is defined to be and 

< (5 < 1/8 . 

The pool of experts is mixed the same way as in the case of the histogram- 
based strategy. That is, letting {qk,e} be a probability distribution over the 



7 



set of all pairs {k, i) of positive integers such that q^^i > for all k and i, for 
Tjn > 0, we define the weights 

together with their normalized values 

The general prediction scheme Qn at time n is then defined by weighting 
the experts according to their past performance and the initial distribution 

{qk/}- 

oo 

gn{xl, y^') = J2 PkAnhl'''\xl yr'), n = 1, 2, . . . 
k,e=i 

Theorem 2.2 Denote byC the class of all jointly stationary and ergodic pro- 
cesses {{Xn,Yn)}'^^ such that E{Yq^} < oo. Choose the learning parameter 
Tjn of the algorithm as 

1 

Vn 

'n 



and suppose that ^ and ^ are verified. Then the moving -window -based 
prediction strategy defined above is universally consistent with respect to the 
class C. 

The proof of Theorem 12.21 is in Section HI This theorem may be extended 
to a more general class of kernel-based strategies, as introduced in the next 
remark. 

Remark 2.1 (General kernel function) Define a kernel function as 
any map K : M_|_ — >■ M_|_. The kernel-based strategy parallels the moving- 
window scheme defined above, with the only difference that in definition ^ 
of the elementary strategy, the regression function estimate is replaced by 

Observe that if K is the naive kernel K{x) = l{a;<i} {where 1 denotes the 
indicator function and x G we recover the moving-window strategy dis- 
cussed above. Typical nonuniform kernels assign a smaller weight to the ob- 
servations x\_^ and ylzl whose distance from x^_i^ and y^Zl is larger. Such 
kernels promise a better prediction of the local structure of the conditional 
distribution. 
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2.3 Nearest neighbor-based prediction strategy 

This strategy is yet more robust with respect to the kernel strategy and 
thus also with respect to the histogram strategy. This is because it does 
not suffer from the scaling problems of histogram and kernel-based strategies 
where the quantizer and the radius have to be carefully chosen to obtain 
"good" performance. 

To introduce the strategy, we start again by defining an infinite array 
of experts h^'''^\ where k and i are positive integers. Just like before, k is 
the length of the past observation vectors being scanned by the elementary 
expert and, for each i, choose pe G (0, 1) such that 

hm Pi = 0, (6) 

t— »oo 

and set 

i= [pin\ 

(where [.J is the floor function). At time n, for fixed k and i {n > k+i+1), the 
expert searches for the i nearest neighbors (NN) of the last seen observation 
and y^Zl in the past and predicts accordingly. More precisely, let 

J^'=.^)= {k<t<n: ixl„ ylzl) is among the i NN of y^zl) in 

and introduce the elementary predictor 

"'n l-^l'i/l J ~ -'minin*/} I (k/)\ I 

V l-'^" ' I / 

if the sum is non void, and otherwise. Next, set 

0<i<i. 

Finally, the experts are mixed as before: starting from an initial probability 
distribution {qk/}, the aggregation scheme is 

oo 

g4xlyr')= J]pM>i'''H^i,l/r'), n = l,2,..., 

k,e=i 

where the probabilities Pk,i,n are the same as in 1^. 
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Theorem 2.3 Denote by C the class of all jointly stationary and ergodic 
processes {{Xn.Yn)}'^^ such that IE{^o'^} < oo. Choose the parameter rjn of 
the algorithm as 

1 




and suppose that ^ is verified. Suppose also that for each vector s the 
random variable 

ii(xr\F,^)-sii 

has a continuous distribution function. Then the nearest neighbor prediction 
strategy defined above is universally consistent with respect to the class C. 

The proof is a combination of the proof of Theorem 12.21 and the technique 
used in [22] • 



2.4 Generalized linear prediction strategy 

This section is devoted to an ahernative way of defining a universal predic- 
tor for stationary and ergodic processes. It is in effect an extension of the 
approach presented in Gyorfi and Lugosi [18j to non-necessarily bounded pro- 
cesses. Once again, we apply the method described in the previous sections to 
combine elementary predictors, but now we use elementary predictors which 
are generalized linear predictors. More precisely, we define an infinite array 
of elementary experts h'^^'^\ /c, £ = 1, 2, . . . as follows. Let {(j)"pYj=i be real- 
valued functions defined The elementary predictor h^n'^^ 
generates a prediction of form 

where the coefficients calculated according to the past observations 

Xi, l/i 'S and 

o<.<l. 

Formally, the coefficients Cnj are defined as the real numbers which minimize 
the criterion 

n-l / t \ 2 

t=k+i \j=i / 

if n > + 1, and the all-zero vector otherwise. It can be shown using 
a recursive technique (see e.g., Tsypkin ^29j, Gyorfi ^15j, Singer and Feder 
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[22j, and Gyorfi and Lugosi [18j) that the be calculated with small 

computational complexity. 

The experts are mixed via an exponential weighting, which is defined the 
same way as earlier. Thus, the aggregated prediction scheme is 

oo 

gnixlvr') = J]pM>i'''H^i,l/r'), n = l,2,..., 

k,e=i 

where the Pk,e,n are calculated according to 

Combining the proof of Theorem 12.21 and the proof of Theorem 2 in ^8] 
leads to the following result: 

Theorem 2.4 Suppose that \(f)f^\ < 1 and, for any fixed k, suppose that the 
set 

|^c,0f ; (ci,...,q), ^=1,2,... I 

is dense in the set of continuous functions ofd{k + l) + k variables. Then the 
generalized linear prediction strategy defined above is universally consistent 
with respect to the class of all jointly stationary and ergodic processes such 
that 

E{y(f } < oo. 

We give a sketch of the proof of Theorem 12.41 in Section HI 
2.5 Prediction of Gaussian processes 

We consider in this section the classical problem of Gaussian time series 
prediction (cf. Brockwell and Davis ^). In this context, parametric models 
based on distribution assumptions and structural conditions such as AR(p), 
MA(g), ARMA(p,g) and ARlMA{p,d,q) are usually fitted to the data (cf. 
Gerencser and Rissanen [13], Gerencser [HI [12], Goldenshluger and Zeevi 
|14j). However, in the spirit of modern nonparametric inference, we try to 
avoid such restrictions on the process structure. Thus, we only assume that 
we observe a string realization y^~^ of a zero mean, stationary and ergodic, 
Gaussian process {Yn}'^oo^ ^^Y predict ?/„, the value of the process at 
time n. Note that there is no side information vectors x" in this purely time 
series prediction framework. 

It is well known for Gaussian time series that the best predictor is a linear 
function of the past: 

oo 

'^{Yn I Yn-l, Yn-2, • • •} = '^'j^n-j, 
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where the c* minimize the criterion 




Following Gyorfi and Lugosi [H], we extend the principle of generalized 
linear estimates to the prediction of Gaussian time series by considering the 
special case 

(t>f\yn-l) = yn-jl{l<j<k}, 



I.e., 

k 

Once again, the coefficients Cnj are calculated according to the past obser- 
vations yi~^ by minimizing the criterion: 

2 

t=k+l \j=l 

if n > k, and the all- zero vector otherwise. 

With respect to the combination of elementary experts h^^\ Gyorfi and 
Lugosi applied in [18] the so-called "doubling-trick", which means that the 
time axis is segmented into exponentially increasing epochs and at the be- 
ginning of each epoch the forecaster is reset. 

In this section we propose a much simpler procedure which avoids in 
particular the doubling-trick. To begin, we set 

where 

0<^<1, 

and combine these experts as before. Precisely, let {qk} be an arbitrarily 
probabihty distribution over the positive integers such that for all k, qk > 0, 
and for //„ > 0, define the weights 

and their normalized values 

_ Wk,n 
Pk,n v-^cxD 



12 



The prediction strategy g at time n is defined by 

oo 

9n{yr') = Y.P^^-^n\yi-'), n = 1, 2, . . . 

k=l 

By combining the proof of Theorem 12.21 and Theorem 3 in [L8l , we obtain 
the following result: 

Theorem 2.5 The linear prediction strategy g defined above is universally 
consistent with respect to the class of all jointly stationary and ergodic zero- 
mean Gaussian processes. 

The following corollary shows that the strategy g provides asymptotically 
a good estimate of the regression function in the following sense: 

Corollary 2.1 (Gyorfi and Ottucsak |20| ) Under the conditions of The- 
orem WT^ 

lim - y imYt I Yt^} - g{Yl"^)Y = almost surely. 

n— >oo n ^ 
t=l 

Corollary 12.11 is expressed in terms of an almost sure Cesaro consistency. 
It is an open problem to know whether there exists a prediction rule g such 
that 

lim (E{Yn\Y-[''^} - g{Y^'^)) = almost surely (8) 

n— >oo 

for all stationary and ergodic Gaussian processes. Schafer ^26j proved that, 
under some conditions on the time series, the consistency ([8]) holds. 



3 Experimental results and analyses 

We evaluated the performance of the histogram, moving-window kernel, NN 
and Gaussian process strategies on two real world data sets. Furthermore, 
we compared these performances to those of the standard ARMA family of 
methods on the same data sets. We show in particular that the four methods 
presented in this paper usually perform better than the best ARMA results, 
with respect to three different criteria. 

The two real-world time series we investigated were the monthly USA 
unemployment rate for January 1948 until March 2007 (710 points) and 
daily USA federal funds interest rate for 12 January 2003 until 21 March 
2007 (1200 points) respectively, extracted from the website economagic.com. 
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Figure 1: Monthly percentage change in USA unemployment rate for January 
1948 until March 2007. 
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Figure 2: Daily percentage change in USA federal funds interest rate for 12 Jan- 
uary 2003 until 21 March 2007. 
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In order to remove first-order trends, we transformed tliese time series into 
time series of percentage change compared to tlie previous montfi or day, 
respectively. The resulting time series are shown in Figs. [T] and O 

Before testing the four methods of the present paper alongside the ARMA 
methods, we tested whether the resulting time series were trend/level sta- 
tionary using two standard tests, the KPSS test [2^ and the PP test [TU]. 
For both series using the KPSS test, we did not reject the null hypothesis of 
level stationarity at p = .01, .05 and .1 respectively, and for both series using 
the PP test (which has for null hypothesis the existence of a unit root and for 
alternative hypothesis, level stationarity), the null hypothesis was rejected 
at p = .01, .05 and .1. 

We remark that this means the ARIMA(p, d, q) family of models, richer 
than ARMA(p, q) is unnecessary, or equivalently, we need only to consider 
the ARIMA family ARIMA(p, 0, g). As well as this, the Gaussian process 
method requires the normality of the data. Since the original data in both 
data sets is discretized (and not very finely), this meant that the data, when 
transformed into percentage changes only took a small number of fixed values. 
This had the consequence that directly applying standard normality tests 
gave curious results even when histograms of the data appeared to have near- 
perfect Gaussian forms; however adding small amounts of random noise to 
the data allowed us to not systematically reject the hypothesis of normality. 

Given each method and each time series (|/i, . . . , Um) (here, m = 710 or 
1200), for each 15 < n < m — 1 we used the data (yi, . . . , ?/„) to predict the 
value of i/n+i- We used three criteria to measure the quality of the overall 
set of predictions. First, as described in the present paper, we calculated 
the normalized cumulative prediction squared error Lm (since we start with 
n = 15 for practical reasons, this is almost but not exactly what has been 
called Ln until now). Secondly, we calculated L^, the normalized cumulative 
prediction error over only the last 50 predictions of the time series in order to 
see how the method was working after having learned nearly the whole time 
series. Thirdly, since in practical situations we may want to predict only the 
direction of change, we compared the direction (positive or negative) of the 
last 50 predicted points with respect to each previous, known point, to the 50 
real directions. This gave us the criteria A^^: the percentage of the direction 
of the last 50 points correctly predicted. 

As in [19j and [22j|, for practical reasons we chose a finite grid of ex- 
perts: k = 1, . . . ,K and £ = 1, . . . ,L for the histogram, kernel and 
NN strategies, fixing K = 5 and L = 10. For the histogram strat- 
egy we partitioned the space into each of {2^, 2^,..., 2^^} equally sized 
intervals, for the kernel strategy we let the radius r^^ take the values 
r^^ e {.001, .005, .01, .05, .1, .5, 1, 5, 10, 50} and for the NN strategy we 
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set i = I. Furthermore, we fixed the probabihty distribution {(lk,i\ as the 
uniform distribution over the K ^ L experts. For the Gaussian process 
method, we simply let if = 5 and fixed the probability distribution {g^} 
as the uniform distribution over the K experts. 

Used to compare standard methods with the present nonparametric strat- 
egies, the ARMA(p, q) algorithm was run for all pairs (p, q) G {0, 1, 2, 3, 4, 5}^. 
The ARMA family of methods is a combination of an autoregressive part 
AR(p) and a moving- average part MA(g). Tables [1] and [2] show the histogram, 
kernel, NN, Gaussian process and ARMA results for the unemployment and 
interest rate time series respectively. The three ARMA results shown in each 
table are those which had the best L^, L^m ^^nd A^^ respectively (sometimes 
two or more had the same A^", in which case we chose one of these randomly). 
The best results with respect to each of the three criteria are shown in bold. 





Lni 


r 50 


^50 


histogram 


15.66 


4.82 


68 


kernel 


15.44 


4.99 


68 


NN 


15.40 


4.97 


70 


Gaussian 


16.35 


5.02 


76 


ARMA(1,1) 


16.26 


5.31 


72 


ARMA(0,0) 


16.68 


4.86 


78 


ARMA(2, 0) 


16.46 


5.12 


78 



Table 1: Results for histogram, kernel, NN, Gaussian process and ARMA pre- 
diction methods on the monthly percentage change in USA unemployment rate 
from January 1948 until March 2007. The three ARMA results are those which 
performed the best in terms of the L^, and criteria respectively. 

We see via Tables [1] and [2] that the histogram, kernel and NN strategies 
presented here outperform all 36 possible ARMA(p, q) models (0 < p, g < 5) 
in terms of normalized cumulative prediction error L^, and that the Gaussian 
process method performs similarly to the best ARMA method. In terms of 
the and A^^ criteria, all of the present methods and the best ARMA 
method provide broadly similar results. From a practical point of view, 
we note also that the histogram, kernel and NN methods also run much 
faster than a single ARMA(p, q) trial on a standard desktop computer. For 
example, the NN method is of the order of 10 to 100 times faster than an 
ARMA(p, q) for a time series with about 1000 points, depending on the values 
of p and q. 
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Lm 


r 50 


^50 


histogram 


9.78 


0.52 


88 


kernel 


9.77 


0.57 


86 


NN 


9.86 


0.79 


80 


Gaussian 


9.98 


0.62 


82 


ARMA(1,1) 


9.90 


0.78 


70 


ARMA(0, 1) 


10.30 


0.60 


82 


ARMA(3, 0) 


10.12 


0.63 


88 



Table 2: Results for histogram, kernel, NN, Gaussian process and ARMA predic- 
tion methods on the daily percentage change in the USA federal funds interest rate 
from 12 January 2003 until 21 March 2007. The three ARMA results are those 
which performed the best in terms of the Lm, and A^^ criteria respectively. 



4 Proofs 



4.1 Proof of Theorem [272] 

The proof of Theorem 12.21 strongly relies on the following two lemmas. The 
first one is known as Breiman's generalized ergodic theorem. 

Lemma 4.1 (Breiman [4J) Let Z = {Zn}??^ be a stationary and ergodic 
process. For each positive integer t, let T* denote the left shift operator, 
shifting any sequence {. . . , ^-i, zq, ^i, . . .} by t digits to the left. Let {/<}<>! 
be a sequence of real-valued functions such that \\mt-*oo ft{,Z) = f{Z) almost 
surely for some function f . Suppose that Esup^ \ft{Z)\ < oo. Then 

1 " 

lim - V ft{T^Z) = E {f{Z)} almost surely. 



t=l 



Lemma 4.2 (Gyorfi and Ottucsak [20]) Let h^^\h^^\... be a sequence 
of prediction strategies (experts). Let {qt} be a probability distribution on the 
set of positive integers. Denote the normalized loss of any expert h = {hn}'^=i 

by 



n 
t=l 



where the loss function C is convex in its first argument hf. Define 



17 



where rjn > is monotonically decreasing, and set 

_ Wk,n 
Pk,n v^oo 

If the prediction strategy g = {gn}'^=i is defined by 



9n = ^PKnh^n\ n = 1, 2, . . . 



k=l 



then, for every n>l, 

Ug) < inf - ^) +^j2^tf2pk,tC\hP,Y,) 

Proof of Theorem 12. 2L Because of ([T]) it is enough to show that 
lim sup Ln{g) < L* almost surely. 

n— >oo 

With this in mind, we introduce the following notation: 

^ _ ^ {k<t<n: ||xj„,-z||<rfc,„ \\ylzl-s\\<rl,}^t 



\{k<t<n: \\xl_f^ - z\\ < rk^e, Ib^fe - s|| < r^^J | 



for all n > A; + 1, where 0/0 is defined to be 0, z e (R'^)^+^ and s e M''. Thus, 
for any h^^^^\ we can write 

By a double application of the ergodic theorem, as n ^ oo, almost surely, 
for a fixed z G (R'^)^+^ and s G M'^, we may write 

^(M)(Xf r^"-! Z S) - nT.{k<t<n: \\XU-z\\<r,,„ ||y/r,^~s||<r^,,} 



i < t < n : \\XU - z|| < rk,i, r/_t - s|| < r',^,}\ 

E{>0l{||x0,-z|l<r-,,,, \\YZ,'-s\\<r'}} 



p{l|xo,-z|| <r,,,, ||yr,^-s|| <r^_,} 

= E{Yo I 1|X°, - z|| < r^,,, llF-i - s|| < r^,}. 
Therefore, for all z and s, 

lim T^i, (Ei'''\X'^,Yr\z,s)) 

n—too \ / 

= T, {E{Yo I \\X% - z|| < r,,,, - s|| < r^^J) 



def / X 
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Thus, by Lemma [4.11 as n oo, almost surely, 

n 

t=i 
1 " 

= - (^min{t*/} (-^f '^^(-^1, ^t-fe) ^/-fc )) - 

t=l 

def 

— £^fe/ • 

Denote, for Borel sets A C {R'^)^+^ and 5 C M*^, 
and set 

V'fclz, s) ''^ E{Y, I = z, Ft = s}. 
Next, let Ss^r denote the closed ball with center s and radius r. Let 

then for any z and s which are in the support of /i^, we have 
(^fc,,(z, s) = T, {E{Yo I - z|| < rfe,,, " ^11 < 



2 



P{||XO,-z|| <r,,,, ||r-,i-s|| <r^^,} 

= ^£ I —7^ — ^ T / <^kA^^ y) f^k{dx, dy) 

V'fc(z,s), 

as f — > cxD and for /x^-almost all s and z by the Lebesgue density theorem 
(see Gyorfi, Kohler, Krzyzak and Walk [T7], Lemma 24.5). Therefore, 

lim ipkAX\ KTfc ) = YZ^) almost surely. 

Observe that 

l2 



,2 



^lAz,s) = [T, (E{Fo I \\X\ - z|| < r^,,, " ^11 < 

< (E{Fo I \\X\ - z|| < r,,,, lirt - s|| < r^^,})^ 

(since \Te{z)\ < \z\) 

< E{Y,' I - z|| < r,,,, lirt - s|| < r',^,} 

(by Jensen's inequality). 
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Consequently, 

e>i 

due to the assumptions of the theorem. Therefore, for fixed k the sequence of 
random variables {^k,£{X^f^,Y~i^)}'^^ is uniformly integrable and by using 
the dominated convergence theorem we obtain 

hm e,,e = lim E YZ,') - Fq)'} 

= E{(E{Fo|x°„rt}-ro)'} 

def 

— £k- 

Invoking the martingale convergence theorem (see, e.g.. Stout [28]), we then 
have 

^lim = E {(E{ro|X°^, Vr^} - YoY} = L\ 
and consequently, 

lim Sk/ = L*. 

k,£—*oo 

We next apply Lemma 14.21 with the choice rin = 1/ \fn and the squared 

loss 

L(huYt) = {ht-Yt)\ 

We obtain 

L„(9) < inf - iliilM) 

^ n ^ 00 . 

+ ilE7|Ef"..(''rw.n'-')-K.) . 

t=\ V k,i=\ 
On one hand, almost surely, 

hmsupinf (L^{h^^^'^)-'^-hp\ 

<inflimsup (L^(h'^^''^)-^h^] 
= inf lim sup LJh^^''^'') 

k t 

= inf ek,i 
k,e 

< lim Sk/ 
= L*. 
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On the other hand, 



n oo 

t=i ^'^ k/=i 

„ n ^ oo 



^ ^ E 7j E I Epm,.'- + E p".'*" + EfM.i'. 



^=1 i=\t^] e=i 



_ 8 t^-^ + 

Therefore, almost surely, 

^ n _ oo 

hmsup - E 4 E n*-^) - Yt)' 

8 

< lim sup — 2^ —p 

n— >oo ^_-|^ Vl 

= 

(since 5 < 1/8 and E{Fo^} < oo). 
Summarizing these bounds, we get that, almost surely, 

lim sup Lnig) < L*, 

and the theorem is proved. □ 



4.2 Sketch of the proof of Theorem 12.4 

For fixed k and i, let 



(c^, . . . , c^) G arg min 

(ci,...,q) 



Then, following the proof of Theorem 2 in [TH] one can show that for all 
J e {!,...,£}, 

lim Cn,j = c* almost surely, (9) 

n— >oo ' 
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where the c„ j are defined in Using equahty © and Lemma rO| for any 
fixed k and £ we obtain that, almost surely, 



lim L„(/i(^''^)) = lim - y f/if'')(xJ,r/-^)-ri' 

n— >oo n— >oo 7^ ' ^ \ j 

t=k+\ 

t=fc+l \ \j=l / / 



2 



def 

— ^fcj 



Then, with similar arguments to Theorem 2 in [18], it can be shown that 

lim ek,i < L*. 

Finally, by using Lemma the assumptions 6 < 1/8 and E{yQ^} < oo, and 
repeating the arguments of the proof of Theorem 12.21 we obtain 

lim sup L„ (51) < inf^fc^^ < L*, 

n— >oo 

as desired. □ 
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